Call:
lm(formula = age_stop ~ mom_age, data = bf)
Coefficients:
(Intercept) mom_age
5.920 0.389
2024-09-17
http://www.pmean.com/new-images/21/dark-side-01.png
Call:
lm(formula = age_stop ~ mom_age, data = bf)
Coefficients:
(Intercept) mom_age
5.920 0.389
# A tibble: 5 × 2
mom_age age_stop
<dbl> <dbl>
1 20 5
2 20 29
3 20 6
4 20 NA
5 20 12
Predicted age_stop = 5.92 + 0.389*20 = 13.7
# A tibble: 1 × 2
mom_age .fitted
<dbl> <dbl>
1 20 13.7
# A tibble: 4 × 5
.rownames mom_age age_stop .fitted .resid
<chr> <dbl> <dbl> <dbl> <dbl>
1 8 20 5 13.7 -8.70
2 40 20 29 13.7 15.3
3 44 20 6 13.7 -7.70
4 67 20 12 13.7 -1.70
---
data_dictionary: breast-feeding-preterm.csv
description: >
This data comes from a research study done at Children's Mercy Hospital
and St. Luke's Medical Center. This was a study of breast feeding in
pre-term infants. Infants were randomized into either a treatment group
(NG tube) or a control group (Bottle). Infants in the NG tube group
were fed in the hospital via their nasogastral tube when the mother
was not available for breast feeding. Infants in the bottle group
received bottles when the mothers were not available. Both groups
were monitored for six months after discharge from the hospital.
copyright:
This data is in the public domain and is freely avaiable for anyone
to use. Acknowledgement of the source is appreicated but not required.
format:
comma-delimited
varnames:
first row of data
missing_value_code: -1
size:
rows: 84
columns: 28
vars:
feed_typ:
scale: categorical
values:
- Control
- Treatment
age_stop:
label: Age at which infant stopped breast feeding
scale: ratio
range: non-negative real
unit: weeks
sepsis:
label: Diagnosis of sepsis
scale: categorical
values:
- No
- Yes
total_ab:
label: Total number of apnea and bradycardia incidents
scale: ratio
range: non-negative integer
del_type:
label: Type of delivery
scale: nominal
values:
1: Vaginal
2: C-section
mom_age:
label: Mother's age
scale: ratio
range: positive integer
unit: years
---
title: "Analysis of breast feeding study"
author: "Steve Simon"
format:
html:
embed-resources: true
date: 2024-09-11
---
This program reads data and fits various linear regression models on a breast feeding study in pre-term infants. Find more information in the [data dictionary][dd]. This code is placed in the public domain.
[dd]: https://raw.githubusercontent.com/pmean/datasets/master/breast-feeding-preterm.yaml
## Load the tidyverse library
For most of your programs, you should load the tidyverse library. The messages and warnings are suppressed.
```{r setup}
#| message: false
#| warning: false
library(broom)
library(tidyverse)
```
## Read the data and view a brief summary
Use the read_csv function to read the data. With a large number of variables, you may choose to leave the col_types out.R will usually figure out which variables are numeric and which are strings.
Replace all the numeric codes of -1 with the missing value code (NA).
```{r read}
bf <- read_csv(
file="../data/breast-feeding-preterm.csv",
col_names=TRUE)
glimpse(bf)
```
## Convert -1 to NA
The code below only works because every single variable in the dataset is non-negative.
```{r convert-missing}
bf[bf==-1] <- NA
```
## Calculate statistics for mother's age
```{r gest-age}
bf |>
summarize(
mean_mom_age=mean(mom_age, na.rm=TRUE),
sd_mom_age=sd(mom_age, na.rm=TRUE),
min_mom_age=min(mom_age, na.rm=TRUE),
max_mom_age=max(mom_age, na.rm=TRUE),
n_missing=sum(is.na(mom_age))) |>
data.frame()
```
This is a reasonable distribution of ages. If you saw mothers much younger than 16 years or much older than 44 years, that might raise some concerns about the data.
## Calculate statistics for age_stop
```{r dc-age}
bf |>
summarize(
mean_age_stop=mean(age_stop, na.rm=TRUE),
sd_age_stop=sd(age_stop, na.rm=TRUE),
min_age_stop=min(age_stop, na.rm=TRUE),
max_age_stop=max(age_stop, na.rm=TRUE),
n_missing=sum(is.na(age_stop))) |>
data.frame()
```
The maximum value, 34 weeks, was a bit of concern for me, because the study was a six month study, which would imply the largest value would be 24 or 26. But I was told that breast feeding duration included time in the hospital, which could easily be as long as 8 or 10 weeks for a pre-term infant.
## Plot mother's age and age when breast feeding stopped
```{r plot}
bf |>
ggplot(aes(mom_age, age_stop)) +
geom_point() +
xlab("Mother's age (years)") +
ylab("When breast feeding ended (weeks)") +
geom_smooth(method="lm", se=FALSE) +
ggtitle("Plot produced by Steve Simon on 2024-09-17")
```
There is a weak relationship between mother's age and age when she stopped breast feeding.
## Linear regression estimates for predicting age_stop
```{r linear-regression}
m1 <- lm(age_stop~mom_age, data=bf)
m1
```
The estimated average duration of breast feeding increases by 0.39 weeks for each increase of one year in the mother's age. The estimated average duration of breast feeding is 5.9 weeks for a mother of age zero. This is an extrapolation well beyond the range of the data.
Analysis of Variance Table
Response: age_stop
Df Sum Sq Mean Sq F value Pr(>F)
mom_age 1 570.0 569.99 5.7531 0.01879 *
Residuals 80 7925.9 99.07
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] 0.06708949
## Analysis of variance table for age_stop
```{r anova}
anova(m1)
```
The F-ratio is large and the p-value is small, so you would reject the null hypothesis and conclude that there is a linear relationship between mother's age and duration of breast feeding.
## R-squared for age_stop
```{r r-squared}
glance(m1)$r.squared
```
Although there is a statistically significant relationship between mother's age and duration of breast feeding, as shown above, this relationship is very weak.
2.5 % 97.5 %
(Intercept) -3.19546976 15.035265
mom_age 0.06625878 0.711827
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.92 4.58 1.29 0.200
2 mom_age 0.389 0.162 2.40 0.0188
## Confidence interval for the slope
```{r ci-age-stop}
confint(m1)
```
The confidence interval includes only positive values, so we are 95% confident that the duration of breast feeding increases as the mother's age increases. The slope could be as small as 0.067 weeks per year of mother's age or as large as 0.71 weeks per year of mother's age. This is a very wide interval indicating a large degree of uncertainty about the true value of the slope parameter.
## Alternate test for the slope parameter
```{r t-test-for-age-stop}
tidy(m1)
```
The T statistic is testing the slope parameter is large and the p-value is small, both indicating that you should reject the null hyothesis and conclude that there is a positive relationship between mother's age and duration of breast feeding.
---
title: "Directions for 5501-01 programming assignment"
author: "Steve Simon"
format:
html:
embed-resources: true
date: 2024-08-18
---
This code is placed in the public domain.
## Program template
- Download [simon-5501-05-bf.qmd][tem]
- Store it in your src folder
- Modify the file name
- Use your last name instead of "simon"
- Modify the documentation header
- Add your name to the author field
- Optional: change the copyright statement
[tem]: https://github.com/pmean/classes/blob/master/biostats-1/05/src/simon-5501-05-bf.qmd
## Data
- Download [breast-feeding-preterm.csv][dat]
- Store it in your data folder
- Review the [data dictionary][dic].
[dat]: https://github.com/pmean/datasets/blob/master/breast-feeding-preterm.csv
[dic]: https://github.com/pmean/datasets/blob/master/breast-feeding-preterm.yaml
## Question 1
Calculate descriptive statistics for gestational age (mean, standard deviation, minimum, and maximum) and count the number of missing values. Interpret these results.
## Question 2
Calculate descriptive statistics for age at discharge from the birth hospital and count the number of missing values. Interpet these results.
## Question 3
Pre-term infants spend a longer amount of time in the hospital than full-term infants. In fact, the earlier the baby appears, the longer the amount of time that the infant remains in the birth hospital. Draw a scatterplot to examine whether this pattern holds in this dataset. Consider age at discharge to be the outcome variable when deciding how to draw this scatterplot. Use the geom_smooth function to graph the regression line, but do not extend the line beyond the range of the data.
Be sure to use descriptive labels for the two axes, including units of measurement.
## Question 4
Use the lm function to compute the slope and intercept for the regression model predicting age at discharge using gestational age.
Interpret both the slope and the intercept and state whether the intercept represents an inappropriate extrapoloation.
## Question 5
Calculate an analysis of variance table for this regression model.
Interpret the F ratio and the p-value. What hypothesis do these two statistics test?
## Question 6
Calculate R squared for this regression model. Interpret this value.